ALIWEB - Archie-like Indexing in the WEB

نویسنده

  • Martijn Koster
چکیده

ALIWEB is a framework for automatic collection and processing of resource indices in the World Wide Web. The current ALIWEB implementation regularly retrieves index files from many servers in the Web and combines them into a single searchable database. Using existing Web protocols and a simple index file format, server administrators can have descriptive and up-to-date information about their services incorporated into the ALIWEB database with little effort. As the indices are single files there is little overhead in the collection process and because the files are prepared for the purpose the resulting database is of high quality. This paper discusses the background and objectives of ALIWEB and gives an overview of its functionality and implementation, comparing it to other existing resource directories in the Web. It reviews the experiences with the first months of operation, and suggests possible future directions of ALIWEB. This paper is available in the Web on INTRODUCTION The World Wide Web is a very exciting technology for deploying information on the Internet. In the last year the Web has grown both in terms of its use and the amount of published information. As a side effect of this expansion it is becoming increasingly difficult to find things in the Web. This problem is known as resource discovery and occurs in any large information system. This paper investigates current methods for discovering resources in the Web, and presents ALIWEB, a framework to aid resource discovery in the Web. OVERVIEW OF EXISTING METHODS This section gives an overview of how the methods of resource discovery in the Web evolved, from the start of the World Wide Web project to the present situation. ALIWEB Archie-Like IndexIng in the WEB Mar t i jn Kos ter , 27/03/94 Page 2 of 10 Browsing When there were only a few servers one could browse through the list of servers maintained at CERN, and browse server home pages looking for up-to-date information. Now the number of servers prohibits this, it would take far to long to browse through all the home pages. Listing A number of people started to maintain lists of references to resources on the Web, such as the HCC NetServices list and the NCSA Meta Index. This meant users didn't have to visit all the server home pages, but could browse through an index. However, references became stale as documents and services moved or were revoked because these lists contained references to documents and services beyond the control of the maintainer of the index. Also, as new information became available this would only be added to the index if and when the maintainer found out about it and had the time to do so. Finally, as the number of resources on these lists grew it became cumbersome to look through them to locate a specific resource. The author himself maintained such an index, arranged by subject, but eventually gave up  The manual maintenance became too time consuming, and the index was not representative enough of the wealth of resources in the Web. Searching A solution to the problem that the index documents became too large to handle is to turn them into searchable databases, such as The GNA Meta Library. This database still suffers from the problems of manual maintenance: it is time-consuming, and the information becomes out of date. Automatic Collection One of the latest searchable catalog is the CUI W3 catalog, which is based on automatic retrieval of a fixed set of documents that it has been specifically programmed to parse. As one of the documents it uses is the NCSA What's New list, this CUI W3 Catalog is very comprehensive and up-to-date. It doesn't solve all problems though, references still go out of date, there is duplication, and new sources have to be added by hand. In addition it has got a new problem: it is difficult to isolate information items with relevant context from some of source documents it uses, as they are not in a single defined format. Automatic Discovery There are indexers that take automation to the limit. It is possible to write programs that traverse ('walk') the World Wide Web, analysing and/or storing the documents encountered. These 'web-walkers' or 'spiders' have the advantage that they are very thorough, and can potentially visit all browseable parts of the Web. They are used for many different purposes; to find stale references, to discover new servers, to estimate the size of the Web, query databases for information, and they can index the Web (E.g. The JumpStation and the RBSE's URL database, see the Robots Page for others). These systems too have problems. In the first place, spiders are potentially dangerous animals. As they retrieve many documents, they have considerable network overhead, especially when several robots operate simultaneously. They can request documents in such quick succession that the server becomes 1 2 3 4 6 7 8 ALIWEB Archie-Like IndexIng in the WEB Mar t i jn Kos ter , 27/03/94 Page 3 of 10 overloaded. These problems regularly give rise to heated debates about robots. Currently there is an effort underway to prevent some of these problems: there are Guidelines for Robot Writers and there is a Proposal for Robot Exclusion. Another problem is the actual processing of documents. Because spiders come across plenty of documents it is very hard, if not impossible for them to get a sensible context for an information item. However the most important problem is that they retrieve all documents that can be reached, even those that aren't interesting or suitable to index. For example, the Hypertext Mac Archive contains 5000 files, indexed into 106 folders, and allows searching for filenames and descriptions. If someone is looking for a Macintosh file, they can do so easily and quickly by browsing or searching from the Hypertext Mac Archive welcome screen. In contrast, a spider will not only retrieve the welcome screen, but all 106 folder indices as well. Not only does this mean the resulting database of documents is very big, it also means that any structure of documents is lost. Parallels in other Information System These problems of resource location in distributed environment are by no means new. Two other global information systems that have encountered this problem, and come up with solutions, are the Anonymous FTP sites, and the Gopher servers. The Anonymous FTP sites have Archie an automatic cataloging system that periodically retrieves listings of filenames from Anonymous FTP sites, via Anonymous FTP. These listings are combined into a searchable database, which can be accessed from the Internet via telnet, special clients, and gateways in other information systems including the Web14. Archie has been very popular, as it is very effective to find the kind of files people look for on FTP sites. Related files are usually combined in tar files, which gives a grouping, and placed in directories with descriptive names, which gives context to the search results The Gopher servers have a similar system called Veronica which traverses Gopherspace and indexes Gopher menus, and provides a Gopher interface to search the resulting database. Although Veronica is also popular it is not always effective to find things, because there is no grouping of related Gopher menu items, so that you can easily get very many matches. There is also no structure information with the Gopher Menu items, which means there is little context to select items on. Both these systems have proved extremely useful in their environments, and their success is in a large part due to the fact they operate automatically. For example in Archie, once an Anonymous FTP site has registered its desire to be included, no further bilateral effort is required to update the information. All these approaches to resource discovery in the World Wide Web have their strengths and weaknesses. Even though all these approaches are useful, none is so universal or effective that there is no need for other solutions. This situation has prompted the design and implementation of ALIWEB. THE ALIWEB FRAMEWORK This section outlines the objectives, design, and implementation of ALIWEB. 9 In a local test the number of documents a single spider requested from a lightly loaded server averaged 3 per second. 10 11 12 13 Deutsch, P. & Emtage, A., "The archie System: An Internet Electronic Directory Service," ConneXions, Volume6, No. 2, February 1992. 14 15 ALIWEB Archie-Like IndexIng in the WEB Mar t i jn Kos ter , 27/03/94 Page 4 of 10 Objectives ALIWEB aims to combine the successful features of current resource location strategies in the Web, while minimising their disadvantages. Specifically, the objectives of ALIWEB are: To reduce the effort required to maintain an index. To reduce the effort required to search the index. To place low requirements on infrastructure. To help towards future systems. What ALIWEB doesn't attempt to do is: Pretend to solve the Internet resource discovery problem. Make other searchable indices obsolete. Provide the only or best implementation. Design This section discusses the design of the ALIWEB model. Decisions To address the objectives a number of design decisions are of importance: To reduce the effort required to maintain an index. The Browsing, Listing and Searching methods have shown that a manually maintained central database requires a lot effort, and is not practical on a large scale. It is preferable to index in a distributed manner. To avoid extra efforts in development, learning, installing and maintaining new systems it is attractive to use the normal WWW protocols and mechanisms for this distributed indexing. In a large distributed system scaleability needs to be considered, especially in the light of the recent growth of the Web. This implies open and extendible mechanisms. To reduce the effort required to search the index. The Browsing and Listing methods have shown that letting the user wade through large amounts of indexed information is not very attractive. The searchable databases discussed in the Searching and Automated Collection methods ave proven to be more user friendly. To prevent wasting effort following stale or double references it is important that the index is up-to-date and has no duplicate entries. This can be achieved by letting the information providers manage their own index, rather than relying on third parties. To make the searching of a large index fruitful it is important to have a provision for extra meta information that can be used to narrow down search results, such as keywords and descriptions. It is also important that information providers themselves can decide what material should be indexed or not, to eliminate irrelevant information. Both these requirements make it easier to retain context and structure of resources. To place low requirements on infrastructure. This requirement rules-out the use of Web traversing robots, as the overhead of retrieving or even checking every document on a Web server is very high both in terms of network traffic and server load, and unacceptable for many server administrators. ALIWEB Archie-Like IndexIng in the WEB Mar t i jn Kos ter , 27/03/94 Page 5 of 10 It is also another reason to use standard WWW methods rather then invent new ones that place an additional requirement on the infrastructure. To help towards future systems As is the “way of the Internet” ALIWEB aims to provide a simple solution to a simple problem, not a solution that attempts to solve the entire resource discovery problem. Other systems exist and will come in place, and it is important to make sure that ALIWEB is as flexible as possible to satisfy future requirements, and that its information can be used by other systems. It is to be hoped that the experience with the ALIWEB pilot can function as input in discussions about the resource discovery problem in the Web. Overall Architecture Summing up the design decisions we need a distributed indexing system where information providers have control over the indexing of their own information only, an automatic method of combining these index files, and a searchable database for searching the information. This architecture of ALIWEB is very similar to that of Archie, hence the name Archie-Like Indexing in the WEB. One site retrieves index files from servers in the Web, combines them into a database, and allows the database to be searched (see Figure 1). However, there a number of differences, explained in section below.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Indexing the World

The World Wide Web provides readers with a potentially effortless way to retrieve and view related materials from around the world. Although readers can reach many documents by stepping through explicit links, there is no convention to show what other related documents exist. Readers must know addresses of appropriate starting points for exploration. Indices (such as Jumpstation and ALIWEB) and...

متن کامل

بررسی وضعیت نمایه شدن مجلات لاتین مصوب علوم پزشکی کشور در نمایه‎نامه های معتبر جهانی

Background and Aim: Today journals are one of the main platforms to exchange information between researchers. This study aimed to assess the status of Approved Latin indexing journals in the field of medical science citation indexes Web of Science and Scopus databases. Materials and Methods: This study was a cross-sectional descriptive survey. Statistical population of the study was 83 titles ...

متن کامل

Hidden Web Indexing Using HDDI Framework

There are various methods of indexing the hidden web database like novel indexing, distributed indexing or indexing using map reduce framework. Our goal is to find an optimized indexing technique keeping in mind the various factors like searching, distribute database, updating of web, etc. Here, we propose an optimized method for indexing the hidden web database. This research uses Hierarchical...

متن کامل

Towards Better Integration of Dynamic Search Technology and the World-Wide Web

Most World-Wide Web (WWW) sites make minimal use of information retrieval (IR) technology. At best they start with a set of HTML documents and index them with WAIS, a fast but simple information retrieval engine. Users browsing these sites have the option of doing a keyword search of the database. We are building new WWW server software that: • Uses natural language processing (NLP) based retri...

متن کامل

Measurement of Archie Parameters of Some Carbonate Cores at Full Reservoir Conditions (Short Communication)

  Application of Archie equation in carbonate reservoirs is not easy due to high dependence of its parameters on carbonate characteristics. Carbonates are very heterogeneous in nature and hydrocarbon reserve estimation in these reservoirs of mostly oil-wet and intermediate-wet is highly influenced by the input values of saturation exponent. To our knowledge, non representative oils have been us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer Networks and ISDN Systems

دوره 27  شماره 

صفحات  -

تاریخ انتشار 1994